Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Automatic Indexing of Newspaper Microfilm Images

Identifieur interne : 001934 ( Main/Exploration ); précédent : 001933; suivant : 001935

Automatic Indexing of Newspaper Microfilm Images

Auteurs : Hong Liu [Singapour] ; Lim Tan [Singapour]

Source :

RBID : ISTEX:F22D67DB06336CE708F623A72F71D3ACF9D0BD3D

Abstract

Abstract: This paper describes a proposed document analysis system that aims at automatic indexing of digitized images of old newspaper microfilms. This is done by extracting news headlines from microfilm images. The headlines are then converted to machine readable text by OCR to serve as indices to the respective news articles. A major challenge to us is the poor image quality of the microfilm as most images are usually inadequately illuminated and considerably dirty. To overcome the problem we propose a new effective method for separating characters from noisy background since conventional threshold selection techniques are inadequate to deal with these kinds of images. A Run Length Smearing Algorithm (RLSA) is then applied to the headline extraction. Experimental results confirm the validity of the approach.

Url:
DOI: 10.1007/3-540-45869-7_41


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Automatic Indexing of Newspaper Microfilm Images</title>
<author>
<name sortKey="Liu, Hong" sort="Liu, Hong" uniqKey="Liu H" first="Hong" last="Liu">Hong Liu</name>
</author>
<author>
<name sortKey="Tan, Lim" sort="Tan, Lim" uniqKey="Tan L" first="Lim" last="Tan">Lim Tan</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:F22D67DB06336CE708F623A72F71D3ACF9D0BD3D</idno>
<date when="2002" year="2002">2002</date>
<idno type="doi">10.1007/3-540-45869-7_41</idno>
<idno type="url">https://api.istex.fr/document/F22D67DB06336CE708F623A72F71D3ACF9D0BD3D/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000B43</idno>
<idno type="wicri:Area/Istex/Curation">000B28</idno>
<idno type="wicri:Area/Istex/Checkpoint">001048</idno>
<idno type="wicri:doubleKey">0302-9743:2002:Liu H:automatic:indexing:of</idno>
<idno type="wicri:Area/Main/Merge">001A14</idno>
<idno type="wicri:Area/Main/Curation">001934</idno>
<idno type="wicri:Area/Main/Exploration">001934</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Automatic Indexing of Newspaper Microfilm Images</title>
<author>
<name sortKey="Liu, Hong" sort="Liu, Hong" uniqKey="Liu H" first="Hong" last="Liu">Hong Liu</name>
<affiliation wicri:level="4">
<country xml:lang="fr">Singapour</country>
<wicri:regionArea>School of Computing, National University of Singapore, 117543, Kent Ridge</wicri:regionArea>
<orgName type="university">Université nationale de Singapour</orgName>
</affiliation>
</author>
<author>
<name sortKey="Tan, Lim" sort="Tan, Lim" uniqKey="Tan L" first="Lim" last="Tan">Lim Tan</name>
<affiliation wicri:level="4">
<country xml:lang="fr">Singapour</country>
<wicri:regionArea>School of Computing, National University of Singapore, 117543, Kent Ridge</wicri:regionArea>
<orgName type="university">Université nationale de Singapour</orgName>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2002</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">F22D67DB06336CE708F623A72F71D3ACF9D0BD3D</idno>
<idno type="DOI">10.1007/3-540-45869-7_41</idno>
<idno type="ChapterID">41</idno>
<idno type="ChapterID">Chap41</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: This paper describes a proposed document analysis system that aims at automatic indexing of digitized images of old newspaper microfilms. This is done by extracting news headlines from microfilm images. The headlines are then converted to machine readable text by OCR to serve as indices to the respective news articles. A major challenge to us is the poor image quality of the microfilm as most images are usually inadequately illuminated and considerably dirty. To overcome the problem we propose a new effective method for separating characters from noisy background since conventional threshold selection techniques are inadequate to deal with these kinds of images. A Run Length Smearing Algorithm (RLSA) is then applied to the headline extraction. Experimental results confirm the validity of the approach.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Singapour</li>
</country>
<orgName>
<li>Université nationale de Singapour</li>
</orgName>
</list>
<tree>
<country name="Singapour">
<noRegion>
<name sortKey="Liu, Hong" sort="Liu, Hong" uniqKey="Liu H" first="Hong" last="Liu">Hong Liu</name>
</noRegion>
<name sortKey="Tan, Lim" sort="Tan, Lim" uniqKey="Tan L" first="Lim" last="Tan">Lim Tan</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001934 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001934 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:F22D67DB06336CE708F623A72F71D3ACF9D0BD3D
   |texte=   Automatic Indexing of Newspaper Microfilm Images
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024